Combining Auditory Inspirations and Hierarchical Feature Extraction for Robust Speech Recognition
نویسندگان
چکیده
We present speech features inspired by the processing in the auditory periphery and the receptive fields found in the auditory cortex. They have a hierarchical organization and jointly evaluate variations in the spectrotemporal domain. This is why we termed them Hierarchical Spectro-Temporal (HIST) features. For their calculation we apply a Gammatone filterbank to transform the signal into the spectral domain. In a preprocessing based on local competition mechanisms we enhance the formants in the spectrogram. A set of filters learned via ICA (Independent Component Analysis) captures local variations in the spectrogram and constitutes the first layer of the hierarchy. In the second layer these local variations are combined to form larger receptive fields learned via Non Negative Sparse Coding. The dimensionality of the resulting features is reduced via the application of a Principal Component Analysis (PCA) and then fed into a Hidden Markov Model (HMM). We evaluated the performance of these features in a continuous digit recognition task in a variety of different noise conditions, similar to the Aurora task. Our results show, especially in combination with RASTA features, a significant performance improvement in noise. Introduction Already for a long time the process of human speech perception serves as a role model in the development of machine recognition (e. g. Rasta-Plp [1]). Here, we present features which take their inspiration not from psychoacoustic but neurophysiological data. Shamma showed that the primary auditory cortex of young ferrets has a spectro-temporal organization, i. e. the receptive fields are selective to modulations in the time-frequency domain and, as in the visual cortex, have Gabor-like shapes [2]. However, traditionally speech features rely only on spectral representations. Such spectro-temporal features were already used for speech recognition [3, 4, 5], speech detection [6, 7], and source separation [8]. Justified by the found analogies between the visual and auditory cortex in mammals, we developed speech features in strong similarity to the visual object recognition system described in [9]. Its main features are the hierarchical organization in three layers and the unsupervised learning of the receptive fields on the first and second layer. We termed the speech features we derived thereof as Hierarchical Spectro-Temporal (Hist) features and used them a front-end to Hidden Markov Models (HMMs) [10, 11]. In this paper we report further improvements of these features and tests on a continuous digit recognition task. Figure 1: Overview of the feature extraction process. In the following section, the computation and enhancement of the spectrograms are described. The calculation of the Hist features from the spectrograms is explained in the section after that (see Fig. 1 for an overview of the process). Finally, the performance of the Hist features is evaluated, especially in respect to Rasta-Plp features, in the before last section. Preprocessing The spectrograms of the speech signals were computed using a Gammatone filter-bank. We used an Infinite Impulse Response (IIR) implementation of the Gammatone filter-bank [?] having 128 channels ranging from 80Hz to 8 kHz at a sampling rate of 16 kHz. The spectrograms are obtained by rectification and low-pass filtering of the filter-bank response. The sampling rate of the spectrograms was then reduced to 400Hz. Formant enhancement The remaining preprocessing steps enhance the formants in the spectrogram. Via a preemphasis of +6 dB/oct. the influence of the speech excitation signal was compensated for. Next, we used a set of Mexican Hat filters along the frequency axis to remove the harmonic structure of the spectrograms and form peaks at the formant locations. The size of the filter kernels was chosen constant on a linear frequency axis. Due to the logarithmic arrangement of the center frequencies in the Gammatone filterbank in the implementation the size of the kernels varied accordingly. Additionally, the shapes of the filters were adapted to the nonlinear frequency spacing, i. e. the lower part of the filter is wider than the higher part. A second Mexican Hat filter with smaller kernel sizes for lower frequencies thinned the resulting formant tracks. Figure 2 shows the original spectrogram and the result of the formant enhancement of the digit ”one” spoken by a male Time [s] F re q u en cy [ k H z] 0 0.1 0.2 0.3 0.4 0.1 0.4 1 2 4 8 (a) Time [s] F re q u en cy [ k H z] 0 0.1 0.2 0.3 0.4 0.1 0.4 1 2 4 8
منابع مشابه
Phoneme Classification Using Temporal Tracking of Speech Clusters in Spectro-temporal Domain
This article presents a new feature extraction technique based on the temporal tracking of clusters in spectro-temporal features space. In the proposed method, auditory cortical outputs were clustered. The attributes of speech clusters were extracted as secondary features. However, the shape and position of speech clusters change during the time. The clusters temporally tracked and temporal tra...
متن کاملA high-performance auditory feature for robust speech recognition
An auditory feature extraction algorithm for robust speech recognition in adverse acoustic environments is proposed. Based on the analysis of human auditory system, the feature extraction algorithm consists of several modules: FFT, outer-middle-ear transfer function, frequency conversion from linear to Bark scales, auditory filtering, nonlinearity, and discrete cosine transform. Three recogniti...
متن کاملA hierarchical framework for spectro-temporal feature extraction
In this paper we present a hierarchical framework for the extraction of spectro-temporal acoustic features. The design of the features targets higher robustness in dynamic environments. Motivated by the large gap between human and machine performance in such conditions we take inspirations from the organization of the mammalian auditory cortex in the design of our features. This includes the jo...
متن کاملRobust Auditory-Based Speech Feature Extraction Using Independent Subspace Method
In recent years many approaches have been developed to address the problem of robust speaker recognition in adverse acoustical environments. In this paper we propose a robust auditory-based feature extraction method for speaker recognition according to the characteristics of the auditory periphery and cochlear nucleus. First, speech signals are represented based on frequency selectivity at basi...
متن کاملRobust distributed speech recognition in noise and packet loss conditions
a r t i c l e i n f o a b s t r a c t This paper examines the performance of a Distributed Speech Recognition (DSR) system in the presence of both background noise and packet loss. Recognition performance is examined for feature vectors extracted from speech using a physiologically-based auditory model, as an alternative to the more commonly-used Mel Frequency Cepstral Coefficient (MFCC) front-...
متن کاملAn improved model of masking effects for robust speech recognition system
Performance of an automatic speech recognition system drops dramatically in the presence of background noise unlike the human auditory system which is more adept at noisy speech recognition. This paper proposes a novel auditory modeling algorithm which is integrated into the feature extraction front-end for Hidden Markov Model (HMM). The proposed algorithm is named LTFC which simulates properti...
متن کامل